Entropy-Based Variational Scheme with Component Splitting for the Efficient Learning of Gamma Mixtures

Finite Gamma mixture models have proved to be flexible and can take prior information into account to improve generalization capability, which make them interesting for several machine learning and data mining applications. In this study, an efficient Gamma mixture model-based approach for proportional vector clustering is proposed. In particular, a sophisticated entropy-based variational algorithm is developed to learn the model and optimize its complexity simultaneously. Moreover, a component-splitting principle is investigated, here, to handle the problem of model selection and to prevent over-fitting, which is an added advantage, as it is done within the variational framework. The performance and merits of the proposed framework are evaluated on multiple, real-challenging applications including dynamic textures clustering, objects categorization and human gesture recognition.


Introduction
The amount of multimedia data available in the world is increasing at an astounding rate. Analyzing these heterogeneous and multimodal data automatically and extracting knowledge instantly through machine learning techniques has become a substantial problem for various decision-making fields. Among the most used techniques are clustering, recognition and classification [1]. For instance, image classification research focuses on seeking effective image representation that can be utilized to categorize images into different categories and then to learn patterns in these classes. Pattern recognition is often applied in the new-age technical sectors, such as human gesture recognition, face identification, speech recognition and so on. Furthermore, clustering techniques aim at grouping items having the same features and this process can assist businesses, for instance, in identifying separate groups within their customer base. These problems have great practical applications in multimedia information retrieval, machine learning, data security and pattern recognition, to name a few. Much research in various decision-making fields and many real-life computer vision applications has been conducted that focuses on finding efficient algorithms to analyze data accurately. Although much research has been carried out, the obtained performance is far from reliable and leaves these issues open, to a large extent, for further investigation. Indeed, how to build an accurate model of high-dimensional data in a compact and reliable way is one of the most difficult issues.
The knowledge of the statistical properties of the data has a crucial role in the majority of applications. Among the main developed methods in this context, finite mixture models have been broadly adopted, thanks to their flexibility [2][3][4][5][6][7][8]. The basic idea is to assume that the data can be represented by a mixture of distributions, of which we then need to estimate the parameters. For instance, the Gaussian mixture model (GMM) has shown its effectiveness in many applications, due to its simplicity in data modeling [9][10][11].
However, when dealing with mixture models, we face the following challenging issues: (i) selecting a flexible distribution that well-describes and fits complex (non-Gaussian) shapes; (ii) accurately estimating the parameters of the probabilistic model; and, finally, (iii) defining the appropriate number of clusters (known also as studying the model complexity). Furthermore, in many cases, complex data cannot be represented by simple Gaussian distributions.
To deal with conventional GMM limitations, many other alternatives have been proposed. Examples include the Gamma (GaM) mixture which has been shown to fit different types of data and to provide better results than GMM [12][13][14][15] thanks to its long-tailed distributions. In learning statistical mixture model, the most common estimation algorithm is expectation maximization (EM), which is based on the maximum likelihood estimator (MLE) [2,16]. Nevertheless, this estimator suffers from dependency on initialization, may converge to local maxima instead of a global one, and can result in wrong parameters estimation. To overcome such issues, an alternative is using a pure Bayesian approach, such as Markov chain Monte Carlo (MCMC) [17,18], which has proven to be more efficient than MLE, but it is also computationally intensive, and convergence is not always guaranteed. Consequently, to profit from the merits of both pure Bayesian and MLE techniques and avoid their drawbacks, variational Bayes approaches have been proposed as effective alternatives [19][20][21]. In particular, variational approaches are more controllable, are less costly in computation than MCMC, and can efficiently address the problem of overfitting and parameters estimation. The basic idea is to determine the optimal approximation via, for instance, Kullback-Leibler (KL) divergence (i.e., difference between the approximated posterior distribution and the true one) [22].
To accurately determine the number of mixture clusters (this problem is known as model complexity) when dealing with mixture models, some researches have considered different criteria, such as MML and MDL [6]. In other works, the so-called componentsplitting criterion has been investigated [9]. The main idea is to begin with two components (clusters) and to then progressively add more components by splitting existing ones. For example, in [23], entropy measures are computed and investigated via variational learning framework to split the components of Gaussian mixture models.
The objective of our work is to investigate the modeling capabilities of Gamma mixtures and to develop a variational approach to learning finite Gamma mixture models. Moreover, we go a step further, by incorporating an entropy metric and component splitting approach to handle the model selection and parameters estimation problems simultaneously. The added advantage of this method is that it happens within a Gamma mixture model and entropy-based variational framework. Thus, it is possible to automatically select the optimal number of clusters to learn the model's parameters efficiently and to overcome the problem of under-fitting. The merits of the proposed framework are proved through some challenging applications involving dynamic textures clustering, objects categorization and human gesture recognition.
The rest of the paper is organized in the following manner. In Section 2 we introduce the finite Gamma mixture model with local model selection. In Section 3, the details of our variational Bayes learning framework via entropy-based splitting are described. Section 4 is devoted to reporting the obtained results, which are based on several challenging applications, to verify the merits and effectiveness of our framework, and Section 5 concludes the paper.

The Statistical Model
In this section, a brief description of finite Gamma mixture modeling is presented, then we introduce the mixture model with local model selection using a component-splitting approach. The motivation for choosing Gamma mixtures is mainly due to its flexibility in terms of modeling non-Gaussian and complex shapes, and also its ease of use.

Finite Gamma Mixture Model
Let us suppose we have a data set denoted by Y with N data instances Y = { Y 1 , . . . , Y N } (i.e., feature vectors), where each Y i = (Y i1 , Y i2 , . . . , Y iD ) is a D-dimensional positive vector that can be modeled using a Gamma distribution: where Y i (i = 1, . . . , N) satisfies 0 ≤ Y id , for d = 1, . . . , D; α d is the shape and β d the location parameter of this distribution (here θ = {α d , β d }). The function Γ(.) is defined as: is distributed according to a mixture of Gamma distributions with M components, then we have where the vector π j denotes mixing coefficients with the constraints 0 π j 1, and ∑ M j=1 π j = 1. Θ = {θ 1 , θ 2 , . . . , θ M , π 1 , . . . , π M } and θ j = {α jd , β jd } is the set of parameters of the j th mixture component. We now introduce an indicator matrix Z = ( Z 1 , ..., Z N ) which indicates to which component each data sample is assigned. Here Z i = (Z i1 , ..., Z iM ). Z i is a binary vector that satisfies the conditions Z ij ∈ {0, 1} and ∑ M j=1 Z ij = 1, such that Z ij = 1 if Y i ∈ j and Z ij = 0 otherwise. The conditional distribution of Z can thus be defined as: Now, the conditional probability of the data, given Z (class labels), is expressed as

Finite Gamma Mixture Model with Local Model Selection
In this work, we address the problem of model selection in finite Gamma mixtures using a component-splitting approach, which has been successfully applied for the case of Gaussian and Dirichlet mixtures in [9,24]. Indeed, this approach has the advantage of preventing over-fitting. The core idea of this algorithm is to partition (split) the components on the basis of a split criterion into two different sets: fixed and free components. We constrain the algorithm to perform computations on only the free components and we assume that the fixed components fit the dataset already (fixed components perfectly approximate the data). Let us denote by s the free components and let the remaining M − s be the fixed ones. Thus, our framework is developed based on this local model selection design and then we can reformulate the prior distribution of Z in Equation (5) as, where {π j } and {π * j } indicate the mixing coefficients of the free and fixed components, respectively. It is to noted that {π j }, {π * j } > 0 and follow the constraint: Subsequently, we need to introduce a prior over {π * j } (fixed mixing coefficient). It is noted that {π * j } are considered random variables. The goal here is to find the conditional probability of fixed components that depends only on the free mixing coefficients {π j }. As introduced in [9], we choose a prior for π * j as a non-standard Dirichlet distribution.
Next, conjugate priors have to be determined for the model's parameters. Unfortunately, in our case, there are no possible priors. Thus, based on the fact that our parameters are positive and statistically independent, the Gamma distribution is an appropriate choice to approximate these priors ( α and β). They can be expressed as: Finally, the joint distribution of all the random variables is determined as follows: It is noteworthy that free coefficients are considered, here, parameters and not random variables; therefore, we do not place a prior over π.

Model Learning Using Variational Bayes
For the parameter estimation problem, we focus, here, on the application of variational Bayes with the mean field approximation, which has been shown to be an efficient technique for inferring posterior distributions of mixture models [20,25,26]. Indeed, variational Bayes has been proposed as an efficient solution for posteriors approximation with low computational cost, as opposed to other inference approaches such as the MCMC technique [8,27]. Due to the computational complexity of the true posterior p(Θ | Y ), the best methodology to follow is to find a good approximation for it, which we denote by Q(Θ), that can be calculated easily [20]. Indeed, p(Θ | Y ) is known to be intractable and cannot be calculated directly. Accordingly, we propose determining this approximation by maximizing the lower bound, ln(p(Y )), as follows: where Θ = {Z, π, α, β} includes both latent variables and random parameters. Next, we factorize the distribution Q(Θ) into disjoint tractable distributions by using the mean field theory, as in [1]. This process leads to the following expression: Finally, the solutions of the updated variational posteriors are obtained by optimizing L(Q) with respect to each distribution. The resulting solutions are expressed as follows: where the hyperparameters in the above equations can be fixed in a similar way as in [26] by testing and experimenting different values depending on the data set to model .

Gamma Model Learning via Entropy-Based Component Splitting
In this section, we develop a robust variational learning approach through the entropybased splitting method to learn the Gamma mixture model. We are fundamentally encouraged by the entropy principle, as suggested in [23], to learn Gaussian mixtures. The core idea is to evaluate the quality of fitting of a component of the implemented Gamma mixture model. Thus, it is possible to evaluate the goodness of fitting components of such a model. This step is achieved by making a comparison between the theoretical entropy and the estimated entropy. In particular, we proceed by calculating an estimation of the entropy using MeanNN entropy [28] and then compare it with the theoretical maximum entropy to check if a component is truly distributed with Gamma. In case of a significant difference (greater than 10 −2 ), we can conclude that this component does not fit well and so we proceed with a portioning process which leads to the division of the current component into two new clusters. As a result, via the proposed entropy-based learning approach for Gamma mixtures, we can assess accurately the number of components (i.e., define model complexity) by making a comparison between the estimated and theoretical entropies.

Theoretical Entropy of Gamma Mixtures
Let us denote, by Y i , a continuous random variable and, by p( Y), its probability density function; then the expression of the differential entropy of Y is given, as in [29], by: In our case, Y is supposed distributed according to a Gamma distribution (given in Equation (1)). After simplification, we obtain the following theory value of the maximum differential entropy of Y, given as: where ψ is a digamma function, such as ψ(x) = d dx ln (Γ(x))

MeanNN Entropy Estimator
In order to assess if a given component is truly distributed according to a Gamma distribution, we proceed with an estimator, namely, MeanNN entropy, proposed in [28]. It is an extension to the Shannon entropy that allows estimating the entropy H( Y) of a D-dimensional random variable Y i by supposing we have an unknown density function p( Y i ) [30]. The Shannon differential entropy, given in Equation (17), is applied. By estimating ln p( Y i ), we can determine an unbiased entropy estimator. We follow the key idea in [28], where is the diameter of a ball centered at Y i . We suppose that there exists a point within the distance of [ , + d ]. Therefore, it is possible to discover other points having smaller (k − 1) or larger (N −k − 1) distances from Y i . Based on this paradigm, the distance probability function to be satisfied (i.e., between Y i and itsk th nearest neighbor) is given as: p i ( ) represents the -ball mass centered at Y i : the expected value of logp i ( ) with respect to p i ( ) is given: In the whole -ball, p( Y i ) is supposed to be constant. So, we have: where d is the dimension of Y i and V d denotes the unit ball volume. When substituting Equation (22) into Equation (21) which leads to the unbiased kNN estimator of the differential entropy as Based on the assumption in [23], the differential entropy can be extracted from the mean of many estimators corresponding to different values of k. Thus, if we consider all values of k (i.e., from 1 to N − 1) , we obtain the following result of the differential entropy: where i,k is thek-th nearest neighbor of Y i . The maximum entropy of the our Gamma mixture model can be expressed by where H Ga (j) is the maximum differential entropy of the j th cluster. Thereafter, we can assess the quality of fitting the developed model within each cluster while comparing the entropy mentioned above. Indeed, we denote, by Ω Ga , the results of calculating the normalized, weighted sum of the difference between two entropies as in [23]. This output is evaluated for each component of our model, GaMM, as: where H M (j) is the result entropy for the component j, which is computed by the MeanNN estimator. Thus, Ω Ga is in the interval [0, 1]. If the observed dataset is truly Gamma distributed, then the value of Ω Ga reaches to zero. Now the splitting process is based on selecting the component j * with the highest Ω Ga (j) as follow: Thus, we inspect Ω Ga by comparing both the theoretical and estimated entropies of the Gamma mixture, then we split j * into two new components.

Variational Learning Algorithm via Entropy-Based Splitting
The proposed variational inference algorithm for Gamma mixture models is illustrated in Algorithm 1. It is noteworthy that there are two scenarios for our algorithm. In the first one, all components are kept (i.e., all mixing coefficients are different from zero). In this case, the splitting process will be performed with success and the number of clusters (components) will be increased by one (K + 1). The component that will be selected to be split into two new clusters is the one that has the largest Ω Ga (j). The second scenario happens when one of the mixing coefficients is near zero. In this case, its associated component will be deleted (K − 1), the splitting process is not performed, and the algorithm is stopped with k clusters. Note that we start with one component (M = 1).

Algorithm 1: Proposed Entropy-based Variational Learning for GaMM.
(1) Initialization Initialize hyperparameters u, v, g, h, a 0 . (2) Splitting process Split j * into two new components j 1 and j 2 with equal proportion equal π * /2 •M = M + l • Initialise the parameters of j 1 and j 2 using same parameters of j * (3) Perform standard variational Bayes, until convergence. Evaluate Ω Ga , choose j * according to Equation (29), and go to the splitting process in step end

Dynamic Texture Clustering
Dynamic textures (DT) are defined by Doretto et al. [31] as an extension of texture to the temporal domain. In other words, it is a sequence of images of moving scenes that display specific stationary properties in time (e.g., smoke, clouds, sea waves and trees). In such case, the spatial (i.e., appearance) and temporal (i.e., motion) characteristics may not be the same. DT plays a substantial role in many applications and the modeling of DT has been addressed by many researchers to solve different problems, including motion synthesis or retrieval, motion classification, recognition and segmentation [32]. Thus, new concepts that can be derived from static texture approaches are needed to integrate the analysis of temporal variations into the spatial analysis. However, the main issues encountered in dynamic textures analysis arise from the large range of appearances and the association of both temporal and spatial properties.
In order to apply the proposed entropy-based learning model to clustering dynamic textures, some preprocessing steps are performed. First, we start by extracting spatial visual features with scale-invariant feature transform (SIFT) [33], which is largely utilized in such contexts. In order to encode the full dynamic texture, including time information, SIFT is considered insufficient. Furthermore, we propose taking into account other temporal descriptors, such as the so-called space-time interest points (STIPs) [34]. As a result, we calculate 128-dimensional SIFT/STIPs descriptors from each frame of every video through the difference-of-Gaussians (DoG) detector. This step allows the selection of potential interest points in which we ensure both rotation and scale invariance. Next, these features are combined with features extracted using the imagenet trained deep learning model and, finally, normalized and modeled using the developed (GaM-En) approach. It is noted that we do not need to consider the class labels because our aim is to perform clustering analysis in an unsupervised manner.
In this experiment, a challenging dynamic texture dataset, the DynTex database [35], is used to evaluate performance. It consists of more than 650 dynamic texture sequences videos, in PAL format (720 × 576, 25 fps). In this work, we limit our work to a subset of videos representing 10 different categories, including flags, sea, vegetation, clam water, trees, smoke, fountains, fountains, traffic, fountains and rotation. Every category contains 20 videos. Some samples from these categories are depicted in Figure 1.  Table 1. From these results, GaM-En has reached 93.40%; however, the accuracies of the others are less than 88%, which confirms that our model is able to provide better performance. This fact demonstrates a significant improvement when using Gamma distribution and entropy-based variational learning over Gaussian-based models to distinguish dynamic texture categories.

Human Gesture Recognition
Recognizing human gestures has become an important active research direction in the fields of computer vision and pattern recognition that may be applied in many potential applications, such as human-computer interaction, artificial intelligence, video surveillance systems, virtual reality, etc. Indeed, human gestures (or actions) are the natural way of expressing intentions in people's daily lives. The use of gestures can help people with certain disabilities to communicate with others. In particular, hand recognition is a technique that helps in understanding the movement of a hand. Recently, this research field has been gaining increasing attention and, so far, many research works have been conducted on human gesture recognition [16,[36][37][38]. Nevertheless, it still remains a challenging research field, primarily due to the complexity and ambiguity of human motion and of backgrounds. The goal of this experiment is to evaluate our proposed statistical approach (GaM-En) with two types of human gesture recognition, which are hand and body gestures. In this experiment, we proceed as in [39] in order to obtain discriminative features for gesture detection in the spatiotemporal domain from each video. Indeed, both motion and appearance features are extracted for human gestures characterization. For motion features, we use the so-called motion history image (MHI) [40]. Then, the histogram of oriented gradients features (HOG) [41] is adopted for extracting appearance feature, which takes into account the magnitude of edge, direction and corner information. Finally, we apply the model of bag-of-visual words to quantize the resulting discriminative vectors via the K-means algorithm. As a result, a histogram vector (representing the frequency of each visual word) is constructed to model each input frame. After this preprocessing step, we apply our proposed statistical model (GaM-En) to recognize human gestures. In particular, each test video is assigned to the appropriate category with the maximum posterior probability under Bayes' rule.
In this experiment (hand gesture recognition), we consider the Cambridge-Gesture database [42] as a public database. It includes nine hundred (900) image sequences representing nine different classes of hand gesture data. These classes are composed of three primitive hand shapes ('Flat', 'Spread' and 'Vshape') and three primitive motions (leftward, rightward and contracting). In every class there are 100 sequences captured with different illuminations and arbitrary motions, and the size of each image is 320 × 240 pixels. In our case, the dataset is divided into two equal partitions: one is used for training and the other is for testing. Sample hand gesture frames from this database can be viewed in Figure 2.
We also conduct other experiments on human body gesture recognition and we test our approach using the publicly available dataset UMD Keck body-gesture (http: //www.umiacs.umd.edu/~zhuolin/Keckgesturedataset.html, accessed on 5 December 2021) [43]   In order to demonstrate the benefits of using Gamma models with entropy-based variational learning and component splitting for both body and hand gesture recognition, we calculate the confusion matrix for the UMD Keck body-gesture database. Furthermore, we compare the performance on the Cambridge-Gesture dataset through the overall recognition accuracy. This is performed for our approach (GaM-En) and three other mixture methods, Gaussian mixture model with component splitting technique (GM-Split), Gaussian mixture model via entropy-based learning (GM-En), and Gamma mixture model via variational-based learning (GaM-VB). Table 2 reports the average results obtained by testing different approaches 30 times for accuracy and processing time. Based on this comparative study, we can see clearly that our model has a higher overall recognition accuracy (91.66%) than the others. Moreover, the shortest required processing time to reach the optimal solution is obtained with the proposed GaM-En. For the other models, the accuracy is less than 87%. These results prove again the effectiveness of using our entropy-based framework for recognizing human gestures.

Object Categorization
Our last experiment involves the application of object categorization. Indeed, the detection of real-world objects has been an important application of computer vision due to the increasingly huge amounts of images created every day [44,45]. The goal of object categorization is to differentiate the classes of objects from each other. This problem is considered to be difficult due to the changes in viewpoint and illumination conditions that can drastically modify a particular object's appearance. Several research works have tackled the problem of modeling and categorization objects because solving it will help further tasks in pattern recognition and computer vision applications, such as image classification and retrieval. We address, here, this challenging problem and evaluate the performance of our framework by comparing it with other methods. In particular, our aim is to test the effectiveness of our statistical model in terms of clustering the input from a set of images.
In this section, we evaluate our framework on the basis of two challenging databases, Caltech256 [46] and GHIM10K (http://www.ci.gxnu.edu.cn/cbir/dataset.aspx, accessed on 1 August 2021). Caltech256 contains 600 images divided into four categories: Faces, Planes, Bikes and Camels. The GHIM10K dataset contains 400 images divided into four classes, which are Flowers, Boats, Cars and Bugs. Each class consists of 100 images. To make the problem more challenging, the objects are acquired with different lighting, from different angles and against different background conditions. Samples from these two databases are presented in Figures 4 and 5. Generally, when addressing the problem of object categorization the first step is to extract robust descriptors from input data. Thus, a preprocessing step was adopted here to extract visual features using SIFT (scale-invariant feature transform). All extracted local SIFT descriptors are grouped into a collection (corpus). Then, K-means is applied to cluster the corpus and generate a visual words vocabulary. In this experiment, the optimal number of vocabulary words is 50. In order to prove the merits of the proposed framework for object categorization application, we also evaluate other generative model-based methods, such as Gaussian mixture model with component splitting technique (GM-Split), Gaussian mixture model via entropy-based learning (GM-EN), and Gamma mixture model via variational-based learning (GaM-VB). Furthermore, we compare the performance and report the average results from 30 runs in terms of overall categorization accuracy in Table 3. To initialize the model's parameters, different parameter setting are considered to ensure the robustness of our choice. As illustrated in Table 3, we may notice the merits of GaM-En in differentiating different objects from the Caltech256 and GHIM10K datasets by obtaining the highest accuracy rates: 97.84% and 97.02%, respectively. Lower rates of categorization accuracy are obtained by Gaussian-based models (GM-En and GM-Split). These results demonstrate that entropy-based Gamma offers better modeling capabilities over Gaussian-based models when dealing with compositional feature vectors. On the other hand, it is clear from the same depicted table that entropy-based variational learning (GaM-En) outperforms conventional variation (GaM-VB) in learning Gamma mixture models. Table 3. Results of object categorization using different models (average %± standard error (Average time (S))).

Datasets/Method
GaM

Conclusions
This paper has presented a novel entropy-based variational approach with a splitting method to learn the parameters of Gamma mixture models. The main goal is to investigate entropy criteria in order to evaluate whether a given component is truly Gamma distributed. This process is performed by comparing theoretical maximum entropy with that calculated by the MeanNN estimator. Subsequently, in the case of having important comparison difference (i.e., we inspect the component with the highest difference), a splitting process is performed and such component is split into two new components (or clusters), since it is not well-fitted by the mixture model. Our developed framework (GaM-En) leads to a principled solution and has the advantage of avoiding over-and under-fitting issues. Through extensive experimentation, including examining the problems of dynamic texture clustering, human gesture recognition and object categorization, we have validated our framework. The obtained results show that our approach is competitive and outperforms some state-of-the-art methods, thanks to its flexibility and effectiveness in terms of multidimensional data modelling and learning. The developed approach has attractive simplicity and generality that makes it easily applied to many other challenging problems, including text clustering and medical image analysis. To improve the expected results, a promising future work could be the integration of a visual feature selection mechanism into the current framework. We plan also to deal with dynamic data by suggesting an online learning process, instead of batch learning.